Llama
Papers:
- Feb 2023 : LlaMA: Open and Efficient Foundation Language Models
- July 2023: Llama 2: Open Foundation and Fine-Tuned Chat Models
Llama 2
- algorithm for tokenization of text: Bytepair encoding (BPE) algorithm
- Trained on a vast dataset of 2 trillion tokens
- Architecture: Transformer
- pre-normalization with RMSNorm
- SwiGLU activation function
- Rotary Positional Embedding
- KV Cache
- Training:
- AdamW optimizer, incorporates a cosine learning rate schedule with a warm-up period of 2000 steps, and decays the final learning rate to 10% of the peak learning rate.
- It applies a weight decay of 0.1 and gradient clipping.
- Fine-tuning:
- upervised Fine-Tuning (SFT) and
- Reinforcement Learning with Human Feedback (RLHF) components.
-
- Proximal Policy
-
- Rejection Sampling Fine-Tuning
- G Host Attention(GAtt)
- The issue of context loss in multi-turn conversations has been acknowledged and addressed by Meta through the implementation of the GAtt (GHost Attention) method.
- This method involved artificially concatenating instructions to all user messages in the conversation.
- Subsequently, Meta used the latest RLHF (Reinforcement Learning with Human Feedback) model to sample from this augmented dataset. This process resulted in the acquisition of context-rich dialogues and corresponding samples, which were employed for fine-tuning the model, somewhat similar to the concept of Rejection Sampling. The overall outcome demonstrated enhanced attention compared to the existing model. It’s worth noting that this approach was specifically evaluated on 70B models.
Normalization Layer : RMSNorm
Let's take hidden Layer1 and layer2. Both learn based on the input data and the weights.
Need:
- L2 learns from the output of L1 and the weights.
- IF L1-output is not normalized, the output of L1 will be very large or very small.
- Output of L1 depends on the input data and the weights.
- This will make the learning of L2 very difficult.
- L2 will have to learn from any range of values.
- This will make the learning of L2 very slow.
Shapes:
- \(output = X * W^T\) + b
- Shape of \(X = (batch\_size, n\_features) = (10,512)\)
- Shape of \(W = (n_neurons, n_features) = (5, 512)\) → \(W^T = (512, 5)\)
- Shape of output = (batch_size, n_neurons) = (10, 5) → these 5 neurons will become features for the next layer.
- shape of b = (n_neurons) = (5) → bias for each neuron. Will be broadcasted to (10,5)(to all 10 samples)
(Yet to complete the above example with the RMSNorm layer.)
Transformer vs Llama
| Transformer | Llama 2 | |
|---|---|---|
| Norm Layer | Layer Norm | RMSNorm |
| Order of layers | Attention --> Norm | Norm --> Attention |
| Position Encoding | Sinusoidal | Rotary |
| Activation | ReLU | SwiGLU |
Comparison of Llama 1, Llama 2, and Original Transformer Architectures
| Model | Model Size | Params Dimension | n Heads | n Layers | Learning Rate | Batch Size | n Tokens | context length |
|---|---|---|---|---|---|---|---|---|
| Llama 1 | 7B | 4096 | 32 | 32 | 3.0e-4 | 4M | 1.0T | 2k |
| 13B | 5120 | 40 | 40 | 3.0e-4 | 4M | 1.0T | 2k | |
| 33B | 6656 | 52 | 60 | 1.5e-4 | 4M | 1.4T | 2k | |
| 65B | 8192 | 64 | 80 | 1.5e-4 | 4M | 1.4T | 2k | |
| Original Transformer | ||||||||
| Base | 65M | 512 | 8 | 6 | ||||
| Big | 213M | 1024 | 16 | 6 | ||||
| Llama 2 | 7B | 3.0e-4 | 2.0T | 4k | ||||
| 13B | 3.0e-4 | 2.0T | 4k | |||||
| 33B | 1.5e-4 | 2.0T | 4k | |||||
| 70B | 1.5e-4 | 2.0T | 4k |
* n Tokens: Number of tokens in the training dataset